Building a knowledge graph to enable precision medicine

Developing personalized diagnostic strategies and targeted treatments requires a deep understanding of disease biology and the ability to dissect the relationship between molecular and genetic factors and their phenotypic consequences. However, such knowledge is fragmented across publications, non-standardized repositories, and evolving ontologies describing various scales of biological organization between genotypes and clinical phenotypes. Here, we present PrimeKG, a multimodal knowledge graph for precision medicine analyses. PrimeKG integrates 20 high-quality resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action, considerably expanding previous efforts in disease-rooted knowledge graphs. PrimeKG contains an abundance of ‘indications’, ‘contradictions’, and ‘off-label use’ drug-disease edges that lack in other knowledge graphs and can support AI analyses of how drugs affect disease-associated networks. We supplement PrimeKG’s graph structure with language descriptions of clinical guidelines to enable multimodal analyses and provide instructions for continual updates of PrimeKG as new data become available.


Background & Summary
Precision medicine takes an approach to disease diagnosis and treatment that accounts for the variability in genetics, environment, and lifestyle across individuals 1 . To be precise, medicine must revolve around data and learn from biomedical knowledge and health information 2 . Nevertheless, many barriers to efficiently exploiting information across biological scales slow down the research and development of individualized care 2 . While many have acknowledged the difficulties in linking biomedical knowledge to patient-level health information [2][3][4][5] , few realize that biomedical knowledge is itself fragmented. Biomedical knowledge about complex diseases comes from different organizational scales, including genomics, transcriptomics, proteomics, molecular functions, intra-and inter-cellular pathways, phenotypes, therapeutics, and environmental effects. For any given disease, information from these organizational scales is scattered across publications, non-standardized data repositories, evolving ontologies, and clinical guidelines. Developing networked relationships between these sources can support research in precision medicine.
A resource that comprehensively describes the relationships of diseases to biomedical entities would enable systematic study of human disease. Understanding the connections between diseases, drugs, phenotypes, and other entities could open the doors for many types of research, including but not limited to the study of phenotyping [6][7][8] , disease etiology 9 , disease similarity 10 , diagnosis [11][12][13] , treatments 14 , drug-disease relationships [15][16][17] , mechanisms of drug action 18 and resistance 3 , drug repurposing [19][20][21] , drug discovery 22,23 , adverse events 24,25 , and combination therapies 26 . Knowledge graphs developed for individual diseases have yielded insights into respective disease areas [27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42] . Nevertheless, the costs and extended timelines of these individual efforts point to a need for a resource that would unify biomedical knowledge and enable the investigation of diseases at scale. While many primary data resources contain information about diseases, consolidating them into a comprehensive, disease-rich, and functional knowledge graph presents three challenges. First, existing approaches to network analysis of diseases require expert review and curation of data in the knowledge graph 29,30,43 . While incredibly detailed, such efforts require substantial manual labor and expensive expert input, making them difficult to scale. Second, there lacks a consistent representation of diseases across biomedical datasets and

Methods
We proceed with a detailed description of the 20 primary data resources used to build PrimeKG. Table 1 lists primary data resources organized by node types in PrimeKG. Since many resources contain information about multiple types of nodes in PrimeKG, we ordered these resources alphabetically in the following description. a. overview of primary data resources. To develop a comprehensive knowledge graph to study diseases, we considered 20 primary resources and a number of additional repositories of biological and clinical information. Figure 2a provides an overview of all 20 resources. The data resources provide widespread coverage of biomedical entities, including proteins, genes, drugs, diseases, anatomy, biological processes, cellular components, molecular functions, exposures, disease phenotypes and drug side effects. These were high-quality datasets, either expertly curated annotations such as the Disease Gene Network (DisGeNet) of gene-disease associations 78 and the Mayo Clinic knowledgebase 55 , widely-used standardized ontologies such as the MONDO Disease Ontology 44 , or direct readouts of experimental measurements such as Bgee gene expression knowledgebase 79 and DrugBank 80 . A complete list of primary resources and the processing steps are listed in the Methods section.
Bgee gene expression knowledge base in animals. Bgee 79 contains gene expression patterns across multiple animal species. We retrieved gene expression data for humans from ftp://ftp.bgee.org/current/download/calls/ expr_calls/Homo_sapiens_expr_advanced.tsv.gz on 31 May 2021. First, we ensured that all anatomical entities were coded using the UBERON ontology. Then, we filtered the dataset to retain gold quality calls with a false discovery rate corrected p-value of ≤ 0.01. Finally, we filtered the expression rank column to extract highly expressed genes. Based on the range and distribution of the expression rank, we retained all data with an expression rank less than 25,000. After processing, we had 1,786,311 anatomy-protein associations where gene expression was found to be present or absent.
Comparative toxicogenomics database. The Comparative Toxicogenomics Database (CTD) 81 is focused on the impact of environmental exposures on human health. We retrieved information about exposures (05/21 version) from http://ctdbase.org/reports/CTD_exposure_events.csv.gz on 9 Jun 2021. Processing involved removing header comments from the raw file. After processing, our data contained 180,976 associations of exposures with exposures, proteins, diseases, biological processes, molecular functions, and cellular components.
DisGeNET knowledgebase of gene-disease associations. DisGeNET 78 is a resource about the relationships between genes and human disease that has been curated by experts. We retrieved curated disease-gene associations (version 7.0) from https://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_ associations.tsv.gz on 31 May 2021. The raw data file, 'curated_gene_disease_associations.tsv' was not processed further and contained 84,038 associations of genes with diseases and phenotypes.
Disease ontology. Disease Ontology 47 groups diseases in many meaningful clusters using clinically relevant characteristics. For instance, diseases are grouped by the anatomical entity. We retrieved the ontology from https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/HumanDO. obo on 29 Jun 2021. The raw data 'HumanDO.obo' is mapped to disease nodes in PrimeKG. As the MONDO Fig. 1 Overview of PrimeKG multimodal knowledge graph. (a) Shown is a schematic overview of the various types of nodes in PrimeKG and the relationships they have with other nodes in the graph. (b) All disease nodes in PrimeKG shown in a circular layout together with disease-associated information. All relationships between disease nodes and any other node type are depicted here. Disease nodes are densely connected to four other node types in PrimeKG through seven types of relations. (c) Shown is an example of paths in PrimeKG between the disease node ' Autism' and the drug node 'Risperidone' . Intermediate nodes are colored by their node type from panel a. We also display snippets of text features for both nodes to demonstrate the multimodality of PrimeKG. Abbreviations -MF: molecular function, BP: biological process, CC: cellular component, APZ: Apiprazole, EPI: epilepsy, ABP: abdominal pain, +/− associations: positive and negative associations.
Human phenotype ontology. The Human Phenotype Ontology 45 (version hpo-obo@2021-04-13) provides information on phenotypic abnormalities found in diseases. We retrieved the ontology from http://purl.obolibrary.org/obo/hp.obo on 31 May 2021. Processing involved parsing the ontology file to extract phenotype terms in the ontology, parent-child relationships, and cross-references to other ontologies. The processed data contains disease-phenotype, protein-phenotype, and phenotype-phenotype edges. We also retrieved expertly curated annotations from http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa on 31 May 2021. From this curated source, we extracted 218,128 positive and negative associations between diseases and phenotypes. www.nature.com/scientificdata www.nature.com/scientificdata/ Mayo clinic. Mayo Clinic is a nonprofit academic medical center and biomedical research institution care 55 . It maintains a knowledgebase, https://www.mayoclinic.org/diseases-conditions, with information about symptoms, causes, risk factors, complications, and prevention of 2,227 diseases and conditions. We web-scraped the knowledgebase and extracted descriptions for these diseases and conditions using the mayo.py and diseases.py scripts on 28 March 2021. The raw data is available at 'mayo.csv' in the PrimeKG repository.
For example, we illustrate the extracted features for atrial fibrillation from Mayo Clinic. 'Some people with atrial fibrillation have no symptoms […] others may experience signs and symptoms such as Palpitations, Weakness, […] and Chest Pain. The disease occurs when 'the two upper chambers of your heart experience chaotic electrical signals […] As a result, they quiver. The AV node is bombarded with impulses to get through to the ventricles. Certain factors may increase your risk of developing atrial fibrillation, including age, heart disease, […] and obesity. Complications include: 'the chaotic rhythm causing blood to pool in your atria and form clots […] leading to a stroke. […] Atrial fibrillation, especially if not controlled, may weaken the heart and lead to heart failure. To prevent atrial fibrillation, it's important to live a heart-healthy lifestyle […] which may include increasing your physical activity […]. These snippets represent only an overview of over three pages of descriptive features available on atrial fibrillation.  which data records are used to uniquely identify each node type. For example, GO is colored by biological processes, cellular components, and molecular functions because GO terms are the unique identifiers used to define nodes for these three node types. (b) Primary resources are colored by each node type for which they possess information. For example, GO provides links from biological processes, cellular components, and molecular functions to genes. As a result, we add the fourth color to represent the gene/protein class. (c) Illustrated is the process of harmonizing these primary data records to extract relationships between node types. (d) The left side illustrates PrimeKG, and the right side shows all the textual sources of clinical information on drugs and diseases. The node type legend is consistent across the figure. Abbreviations -MF: molecular function, BP: biological process, CC: cellular component, PPI: protein-protein interactions, DO: disease ontology, MONDO: MONDO disease ontology, Entrez: Entrez gene, GO: gene ontology, UMLS: unified medical language system, HPO: human phenotype ontology, CTD: comparative toxicogenomics database, SIDER: side effect resource.
www.nature.com/scientificdata www.nature.com/scientificdata/ International Classification of Diseases (ICD), and Medical Dictionary for Regulatory Activities (MedDRA), it was our preferred ontology for defining diseases. We retrieved the ontology from http://purl.obolibrary.org/ obo/MONDO.obo on 31 May 2021. Processing involved parsing the ontology file to extract disease terms in the ontology, parent-child relationships, subsets of diseases, cross references to other ontologies, and definitions of disease terms. The processed data contains 64,388 disease-disease edges.
Orphanet. Orphanet 48 is a database that focuses on gathering knowledge about rare diseases. The Orphanet resource at https://www.orpha.net/consor/cgi-bin/Disease_Search_List.php?lng=EN has curated information about definitions, prevalence, management and treatment, epidemiology, and clinical description for 9,348 rare diseases. We retrieved the resource data and extracted disease features on 10 May 2021 using the orpha.py script available in the PrimeKG repository.
Let us illustrate features in PrimeKG for rare Hurler syndrome with the Orphanet ID 93473. Hurler syndrome is the most severe form of mucopolysaccharidosis type 1, a rare lysosomal storage disease characterized by skeletal abnormalities, cognitive impairment, heart disease, Four sources of physical protein-protein interactions. Protein-protein interactions (PPIs) are composed of experimentally-verified interactions between proteins. The interactions we consider are diverse in nature and include signaling, regulatory, metabolic-pathway, kinase-substrate and protein complex interactions, which are considered unweighted and undirected. We use the human PPI network compiled by Menche et al. 87 as the starting resource. This resource integrates several protein-protein interaction databases, including TRANSFAC for regulatory interactions 88 , MINT and IntAct for yeast to hybrid binary interactions 89,90 , and CORUM for protein complex interactions 91 . Additionally, we retrieve protein-protein interaction information from from BioGRID 92 and STRING 93 databases. We also consider the human reference interactome (HuRI) generated by Luck et al. 94 , specifically, we use the high throughput investigation (HI) union, a combination of HuRI and several related efforts to systematically screen for protein-protein interactions. The processed data contains 642,150 edges.
Reactome pathway database. Reactome 95 is an open-source, curated database for pathways. We retrieved information about pathways from https://reactome.org/download/current/ReactomePathways.txt, relationships between pathways from https://reactome.org/download/current/ReactomePathwaysRelation.txt and pathway-protein relations from https://reactome.org/download/current/NCBI2Reactome.txt on 31 May 2021. Processing involved extracting ontology information such as hierarchical relationships and extracting pathway-protein interactions. The processed data contains 5,070 pathway-pathway and 85,292 protein-pathway edges.
Side effect knowledgebases. The Side Effect Resource (SIDER) 96 contains data about adverse drug reactions. We retrieved side-effect data (SIDER 4.1 version) from http://sideeffects.embl.de/media/download/meddra_ all_se.tsv.gz and SIDER's drug to Anatomical Therapeutic Chemical (ATC) classification mapping from http:// sideeffects.embl.de/media/download/drug_atc.tsv on 31 May 2021. Processing involved extracting all side effects where the MedDRA term was coded at the preferred term level and then mapping drugs from STITCH identifiers 97 to ATC identifiers. The processed data 202,736 contains drug-phenotype associations.
Uberon multi-species anatomy ontology. Uberon 98 is an ontology that contains information about the human anatomy. We retrieved the ontology from http://purl.obolibrary.org/obo/uberon/ext.obo on 31 May 2021. Processing involved extracting information about anatomy nodes and the relationships between them. The processed data contains 28,064 hierarchical relationships between anatomy nodes. UMLS knowledgebase. The Unified Medical Language System (UMLS) Knowledge Source 46 contains information about biomedical and health-related concepts. We retrieved the complete UMLS Metathesauras from https://download.nlm.nih.gov/umls/kss/2021AA/umls-2021AA-metathesaurus.zip on 31 May 2021 in '.RRF' format. To map UMLS CUI terms to the MONDO Disease Ontology, we used the 'MRCONSO.RRF' to extract UMLS Concept Unique Identifier (CUI) terms in English. We mapped UMLS CUI terms to MONDO terms in two ways. Firstly, we directly extracted cross-references between the two from the MONDO ontology. We indirectly mapped UMLS to MONDO using OMIM, National Cancer Institute Thesaurus (NCIT), MESH, MedDRA, ICD 10, and SNOMED CT as intermediate ontologies.
Further, we used 'MRSTY.RRF' and 'MRDEF.RRF' files to extract definitions for UMLS terms. Of the 127 semantic types in the 'MRSTY.RRF' file, we selected 11 that belonged to the Disorder semantic group in a manner consistent with prior work 99 . These semantic types were congenital abnormality, acquired abnormality, Injury or poisoning, pathologic function, disease or syndrome, mental or behavioral dysfunction, cell or molecular dysfunction, experimental model of disease, signs and symptoms, anatomical abnormality, and neoplastic process. We then used the 'MRDEF.RRF' file to extract definitions for CUI terms from sources that were in English.

B. Standardizing and harmonizing data resources. To harmonize these primary resources into
PrimeKG, we selected ontologies for each node type, harmonized datasets into a standardized format, and resolved overlap across ontologies. The process of defining node types and selecting common ontologies is illustrated in Fig. 2a where primary data records are colored if they are used to define unique identifiers for a node type. In the remainder of this study, we interchangeably refer to 'gene/protein' nodes as proteins and 'effect/ phenotype' nodes as phenotypes. We mapped the aforementioned processed resources to ensure that all nodes were defined using unique identifiers from their respective ontologies and databases. Next, we identified sources of information across different primary resources for each node type to maximize the number of relationships in PrimeKG (Fig. 2b).
Resolving overlap between phenotype and disease nodes. Since both the MONDO Disease Ontology 44 and Human Phenotype Ontology 45 were developed by the Monarch Initiative, there exists a considerable overlap between phenotype nodes and disease nodes across the various datasets. Overlapping nodes were defined as effect/phenotype nodes in HPO that (i) had the same ID number as disease nodes in MONDO and (ii) could be mapped from HPO to MONDO using cross-references found in MONDO. These overlapping phenotype nodes were converted to disease nodes by manipulating edges in various datasets to avoid duplicate nodes. Let us define the set of overlapping phenotype nodes as P. Phenotype-phenotype edges extracted from HPO were converted to phenotype-disease edges if one phenotype node was in P and to disease-disease edges if both phenotype nodes were in P. Protein-phenotype edges extracted from DisGeNet 78 were converted to protein-disease relations if the phenotype node was in P and removed from the group of protein-phenotype edges. Finally, for disease-phenotype and drug-phenotype relations, we dropped any edges where the phenotype was in P. Adding these edges to drug-disease relations would only introduce unnecessary noise to the indication, contraindication, and off-label use edges.

C. Building precision medicine knowledge graph (PrimeKG).
To create PrimeKG's graph, we merged the harmonized primary data resources into a graph and extracted its largest connected component as shown in Fig. 2c. We integrated the various processed, curated datasets and cleaned the graph by dropping NaN and duplicate edges, adding reverse edges, dropping duplicates again, and removing self-loops. This version of the knowledge graph is available in PrimeKG's repository as 'kg_raw.csv' . To ensure that PrimeKG is well-connected and has no isolated pockets, we extracted its largest connected component using the iGraph package 100 (Fig. 2d). Features from DrugBank mapped directly to the knowledge graph since drugs were coded using DrugBank identifiers. Some features had unique attributes for each drug, such as 'state' , 'indication' , and 'mechanism of action' , and others had numerous attributes for each drug, such as 'group' and ' Anatomical Therapeutic Chemical (ATC) classification level' . The latter set of features was converted to single text descriptions by joining features using conjunctions such as ';' and 'and' . Features in Drug Central were mapped to DrugBank IDs using their Chemical Abstracts Service Registry (CAS) identifiers from the vocabulary retrieved from DrugBank. Once all features were mapped, text processing removed all tokens referenced in DrugBank (for example, "[L64839]") with the help of regular expressions. We nullified locations where the text mentioned that no data was available for the half-life feature. Finally, we converted numerical features into textual descriptions in order to standardize the feature set.
We select the drug Prednisolone as an example to illustrate the depth of clinical information available in these features. The '[…]' is used to compress text sections for brevity. Prednisolone is a glucocorticoid similar to cortisol used for its anti-inflammatory, immunosuppressive, anti-neoplastic, and vasoconstrictive effects. Prednisolone has a plasma half-life of 2.  46 (Fig. 2d). Features from all these sources were mapped to the 'node_id' field of disease nodes, which was defined using the MONDO Disease Ontology. Since disease nodes were grouped as described in the Technical Validation section, many diseases defined in the MONDO Disease Ontology (i.e., many 'node_id' values) were collapsed into a single node (i.e., unique 'node_index' values). Since disease features are mapped to MONDO identifiers or the 'node_id' field, it is possible for a single disease node in the knowledge graph, defined by a unique 'node_index' , to have multiple feature values for a given feature. PrimeKG provides these features in their entirety.
www.nature.com/scientificdata www.nature.com/scientificdata/ Disease definitions from the MONDO Disease Ontology were directly extracted from the ontology file and unique for each 'node_id' . Disease descriptions extracted from UMLS were mapped from Concept Unique Identifier (CUI) terms to MONDO and, as a result, numerous for each 'node_id' . Using regular expressions, we removed tokens that were references and URLs from UMLS disease descriptions. From Orphanet, we extracted definitions, prevalence, epidemiology, clinical description, and management and treatment. We mapped the features from Orphanet IDs to MONDO, and as a result, there were multiple for each 'node_id' . We used regular expressions to fix formatting errors in the prevalence and epidemiology features.
We extracted the following disease features from the Mayo Clinic's knowledgebase: symptoms, causes, risk factors, complications, and prevention. Since the Mayo Clinic web-scrapping did not provide a unique identifier in any ontology, we mapped disease names in Mayo Clinic to those in the MONDO Disease Ontology. To develop this mapping, we used a strategy for grouping disease names described in detail in the Technical Validation section. Briefly, we conducted automated string matching followed by manual approval of all disease name mappings based on their Bidirectional Encoder Representations from Transformers (BERT) model 101 embedding similarity. Automated string matching involved approving exact matches and encapsulating matches, where the name in Mayo was completely present in the name in MONDO. During processing the symptoms feature, we used regular expressions to extract the end of the text description that explained when to see the doctor as a new and separate feature. Finally, we fixed formatting errors in the text.
To illustrate the depth and breadth of information covered by the disease features, we select Hepatic Porphyria. The '[…]' notation is used to compress sections of text for brevity. Per the MONDO Disease Ontology, Hepatic Porphyria is a group of metabolic diseases due to deficiency of one of a number of liver enzymes in the biosynthetic pathway of heme. They are characterized by […]. Clinical features include […]. The UMLS has a very similar disease description. According to Orphanet, it's a rare sub-group of porphyrias characterized by neuro-visceral attacks with […]. In most European countries, the prevalence of acute hepatic porphyrias is around 1/75000. In 80% of cases, the patients are female. All acute hepatic porphyrias can be accompanied by neuro-visceral attacks that appear as

Data record
The Precision Medicine Knowledge Graph (PrimeKG) is available at Harvard Dataverse 102 . To develop PrimeKG, we retrieved and collated 20 data resources (detailed in the Methods section) as visualized in Fig. 2a, identified relations across these resources as shown in Fig. 2b,c, harmonized them into a heterogeneous network illustrated in Fig. 2c, and augmented the drug and disease nodes in the network with clinical features depicted in Fig. 2d. Language descriptions of drugs and clinical characteristics of diseases give the features of drug or disease nodes.
PrimeKG is a multimodal knowledge graph with 10 types of nodes, 30 types of undirected edges, and natural language descriptions for disease and drug nodes. PrimeKG contains 129,375 nodes and 4,050,249 edges. Figure 1a shows a schematic overview of the graph structure. We provide a breakdown of the number of nodes by node type and the number of edges by edge type in Tables 1, 2, respectively.  Tables 3, 4 show statistics on the number of features available for drug and disease nodes. Disease features include the disease prevalence information, symptoms, causes, risk factors, epidemiology, clinical description, management and treatment, complications, prevention, and when to see a doctor. Drug features include molecular weight of chemical compounds, indications, mechanisms of action, pharmacodynamics, protein binding events, and pathway information. Figure 1c provides an example of the supporting information available across these features.
The PrimeKG knowledge graph has nodes of 10 types and uses the following terminologies and ontologies to describe the nodes. The node types 'drug' , 'disease' , 'anatomy' and 'pathway' are respectively encoded as terms in DrugBank 80 , MONDO 44 , UBERON multi-species anatomy ontology 98 , and Reactome pathway database 95 . Genes and proteins are treated as a single node type, 'gene/protein' , and identified by Entrez Gene IDs 84 . The node types 'biological process' , 'molecular function' , and 'cellular component' are defined using Gene Ontology (GO) terms 86 . Disease phenotypes extracted from Human Phenotype Ontology (HPO) 45 and drug side effects extracted from Side Effect Knowledgebase (SIDER) 96 are collapsed into a single node type, 'effect/phenotype, ' that is encoded using HPO IDs. Finally, 'exposure' nodes are defined using the ExposureStressorID field, which contains Medical Subject Headings (MeSH) provided by the Comparative Toxicogenomics Database (CTD) 81 .
The datasets in PrimeKG are structured to follow a standardized format. In the following, quotations indicate column names used to define PrimeKG. For each node in the knowledge graph, we provide 'node_index' , which is a unique index to identify the node in PrimeKG; 'node_id' , which indicates the identifier of the node from its ontology; 'node_type' , which indicates the node type (Table 1); 'node_name' which indicates the name of the node as provided by the ontology; and 'node_source' which indicates the ontology from which 'node_id' and 'node_name' fields were extracted. For each edge in PrimeKG, we provide 'relation' , which is the name of the edge type that connects the two nodes (Table 2); 'x_index' , which links to the 'node_index' field; and 'y_index' , which also links to 'node_index' . (  www.nature.com/scientificdata www.nature.com/scientificdata/

technical Validation
As part of the technical validation, we explore the structure and connectivity of PrimeKG.
Distinguishing properties of PrimeKG. Here, we highlight four distinguishing properties of PrimeKG.
We provide evidence and statistical support for these claims about PrimeKG in juxtaposition with three seminal biomedical knowledge graphs SPOKE 38 , HSDN 65 , and GARD 34 . Firstly, PrimeKG provides coverage for an extensive range of diseases. Our knowledge graph comprises 22,236 disease terms that are then grouped into 17,080 clinically meaningful diseases. Compared to existing graphs such as SPOKE, HSDN, and GARD, PrimeKG provides an improvement in disease coverage by one to two orders of magnitude. Next, while these existing resources primarily contain "treats" or indication edges between drugs and diseases, PrimeKG provides more intricate relationships with indication, contraindication, and off-label use edges.
Moreover, PrimeKG provides incredible coverage of rare diseases while remaining integrated with the entire range of diseases. Orphanet 48 is considered the definitive authority on rare diseases. Of 9,348 rare diseases in Orphanet, 90.8% are present in PrimeKG as disease nodes. Previously, knowledge graphs have either scarce coverage of rare diseases (e.g., SPOKE, HSDN) or an exclusive focus on them (e.g., GARD). PrimeKG embraces the entire range of conditions from rare to prevalent across its 22,236 disease terms. Finally, PrimeKG is multimodal and contains clinical features for drugs and diseases. Traditionally, knowledge graphs have been defined solely as relationships between their nodes. These graph relationships tend to encode biological and molecular information but lack medically relevant descriptions. PrimeKG integrates clinically meaningful text descriptions for drug and disease nodes. This multimodality of PrimeKG enables the fusion of medical and molecular knowledge.
PrimeKG is easy to use and update. PrimeKG is available as a single set of triplets of source nodes, relations, and target nodes. This data structure is agnostic to computing preferences and can be read using any programming language. We use Harvard Dataverse 102 to make PrimeKG available in a single user-friendly CSV file so that there is no need for the user to connect to any external databases. Although PrimeKG has millions of edges, it can be loaded into memory using a standard CPU in less than 5.18 seconds ± 51.9 milliseconds. Queries on the knowledge graph generally take less than 1 second. Once loaded, PrimeKG can be converted into a graph structure through various commonly used libraries (such as iGraph or NetworkX for Python). We also provide tutorials for getting started and links to community data loaders on our GitHub repository.
The provenance of PrimeKG can be easily tracked on Harvard Dataverse 102 . All our data curation and processing approaches are transparent, fully reproducible, and can be continually adapted as data resources evolve and new data become available. We have provided detailed instructions on "Building an updated PrimeKG" on our PrimeKG GitHub repository. Complete information is given here for (a) downloading primary data records, (b) processing data for each primary record along with script names and expected outputs, and (c) a ready-to-run Jupyter notebook to build an updated PrimeKG. After building an initial version of PrimeKG, users can update individual resources as necessary.
A case study to evaluate the relevance of PrimeKG to the clinical presentation of autism. For downstream inferences made using PrimeKG to be conducive to studying human disease, disease nodes in PrimeKG would need to be medically relevant. To this end, we next analyze if PrimeKG's representation of  Table 4. Statistics on disease features in PrimeKG. Unprocessed KG refers to the initial knowledge graph assembled from datasets. Processed KG refers to the fully processed PrimeKG, and includes disease groupings. The count column refers to the number of features including duplicates, and the Unique column refers to the number of unique features. Note that the 'Combined' row has counts greater than the total number of diseases in the knowledge graph. This is not a discrepancy but rather reflects the complexity of the data involving missingness for some diseases and multiple descriptors for other diseases.
www.nature.com/scientificdata www.nature.com/scientificdata/ diseases strongly relates to their clinical presentation by carrying out a case study on autism spectrum disorder. We were motivated to investigate autism because it not only has incredible clinical heterogeneity [103][104][105] but this heterogeneity has also been studied to identify clinically meaningful subtypes 56,57 . We gauged the relevance of disease nodes related to autism in PrimeKG in two steps: first, by performing the entity resolution for autism concepts across all relevant primary data resources (see Data Record), and second, by examining the relationship between these autism concepts and clinical subtypes of autism.
We start by exploring whether autism disease nodes in PrimeKG reconciled the variation in autism concepts across databases and ontologies. For example, as demonstrated in Fig. 3a, MONDO disease ontology has 37 disease concepts related to autism, whereas the UMLS has 192 autism-associated concepts, and Orphanet has 6 autism-associated concepts. Although it is not immediately clear how these concepts relate to each other, we cannot develop a coherent knowledge graph without establishing connections between them. To this end, we overcome this challenge by defining all nodes using the MONDO disease ontology and mapping all other vocabularies to diseases in MONDO as outlined in Fig. 3a.
Finally, before using MONDO disease concepts as disease nodes in PrimeKG, we need to assess whether autism disease concepts in MONDO correlate with clinical subtypes of autism. Autism has been shown to manifest as three clinical subgroups characterized primarily by seizures, multisystem and gastrointestinal disorders, and psychiatric disorders 57 . However, it was unclear how MONDO's 37 autism disease concepts (Fig. 3a) relate to the three clinically defined subtypes. In addition, there were many disease concepts in autism, such as ' Autism, susceptibility to, 1' , ' Autism, susceptibility to, 2' , ' Autism, susceptibility to, x-linked' , etc., with no apparent clinical meaning, suggesting that disease nodes in MONDO do not correspond one-to-one to clinical manifestation of autism. For this reason, we developed a strategy to group diseases from MONDO into medically relevant and coherent nodes in PrimeKG. We proceed with describing and evaluating that strategy.
Computational approaches to grouping disease nodes. As demonstrated in our case study of autism, disease concepts in MONDO may not correlate well with medical subtypes. MONDO contains many repetitive disease entities with no apparent clinical correlation. For this reason, we were motivated to group diseases in MONDO into medically relevant entities. Ideally, we would have preferred leveraging expertise across various disease areas when grouping these concepts. However, this approach was time-consuming, expensive, and challenging to execute at scale. Further, disease sub-phenotyping is a relatively new paradigm, so we anticipated low consensus among medical experts on what constitutes a unique disease.
Since manually grouping diseases with expert supervision was not feasible, we took a semi-automated unsupervised approach to group disease concepts in PrimeKG. Advances in natural language processing, specifically the Bidirectional Encoder Representations from Transformers (BERT) model 101 , allowed us to study the similarity between disease concept names. We grouped disease concepts with nearly identical names into a single node with string matching and BERT embedding similarity 101,[106][107][108][109] . We identified disease groups using a string-matching strategy across disease names 110 . In this strategy, we selected a disease that ended with a number, a roman numeral, or any alphanumeric phrase with a length of less than 2, or 'type' as the second-last word. Once such a disease was selected, we extracted the primary disease phrase by dropping the ending and used this phrase to find matches. Matches included diseases with the same initial phrase and those containing all phrase words with no other words, regardless of word order. The words 'type' and '(disease)' were ignored for the latter matching criteria. In this manner, we grouped disease concepts in MONDO with string matching.
We further tightened groupings identified using string matching by exploring word embedding similarities between disease names, which is visualized in Fig. 3b. In natural language processing, word embeddings have been widely and successfully used to resolve conflicting and redundant entities in an unsupervised manner [110][111][112] , and deep language models such as BERT 101 can produce semantically meaningful word embeddings. Specifically, ClinicalBERT 113 is a BERT language model that encodes medical notions of semantics because it has been pre-trained on biomedical knowledge from PubMed 114 and discharge summaries from MIMIC-III 115 . We used ClinicalBERT to extract word embeddings for disease group names identified during string matching. We also defined the similarity between two disease names as the cosine distance between their ClinicalBERT embeddings. Then, after applying an empirically chosen cutoff of similarity ≥0.98, we manually approved the suggested disease matches and assigned names to the new groups. Finally, these groupings were applied to the knowledge graph.
Finally, 22,205 disease concepts in MONDO were collapsed into 17,080 grouped diseases, which has resulted in a higher average edge density across diseases and more clinically relevant disease nodes. We anticipate that PrimeKG is a powerful dataset with this grouping because disease representations are dense and robust.
Systematic evaluation of PrimeKG. Given the substantial investment and time required to develop a novel drug, network analysis has long been used to identify opportunities for expanding the use of already available drugs 116 . We demonstrate that PrimeKG can help detect drug repurposing opportunities. For this validation, we systematically retrieved 40 novel therapies approved by the FDA since June 2021. PrimeKG contains information up to 1 June 2021, limiting data leakage from PrimeKG to these therapies.
Of these 40 recent FDA-approved therapies, we identified 11 repurposed drugs in PrimeKG as listed in Table 5 and the remaining were novel compounds. We identified the disease corresponding to the indication for each drug and conducted network proximity calculations between relevant drug-disease pairs. As a first step, we would expect that only a few drug-disease pairs would have direct indication edges between them. However, as shown in Table 5, only one pair has such an edge, confirming that there has been no temporal data leakage.
We conducted network analysis on these pairs by studying the shortest path distance between the repurposed drug and indicated disease. For each drug, we conducted permutation analysis by sampling 1000 non-indicated diseases and calculating the mean randomized shortest path distance with 95% confidence intervals between these (2023) 10:67 | https://doi.org/10.1038/s41597-023-01960-3 www.nature.com/scientificdata www.nature.com/scientificdata/ pairs. Further, we applied a non-parametric statistical analysis that tested the number of times the indicated shortest path distance was greater than the random shortest path distance and applied a Bonferroni correction to obtain the significance. At a threshold of P ≤ 0.05, relevant drugs are much closer to the indicated diseases than expected by random chance in 8 out of 10 cases (Table 5). For example, in the case of familial hypercholesterolemia, one would need to parse through only 3.4% of drugs in PrimeKG to find a positive hit. This analysis demonstrates the utility of PrimeKG for drug repurposing and provides additional evidence that PrimeKG is a high-quality dataset.
The potential uses of PrimeKG are vast. PrimeKG describes drug features on a deeper biological level and disease features on a deeper clinical level, which can be used to explain genotype-phenotype associations in terms of genes, pathways, or any other nodes in an extensive knowledge graph, like PrimeKG. Consequently, PrimeKG can be paired with deep graph neural networks 117 to help identify disease biomarkers, characterize disease processes, hone disease classification, identify phenotypic traits, and repurpose drugs. With the implementation of machine learning functionality, we anticipate that PrimeKG and similar knowledge graphs will become critical tools in advancing precision medicine.    Table 5. PrimeKG can identify drug repurposing opportunities. We retrieved 11 drugs with new indications approved by the FDA in the year after PrimeKG was assembled. We conducted a network proximity analysis between the repurposed drug and its indicated disease. Only 1 of 11 pairs already has an indication edge in PrimeKG, confirming that there is no temporal data leakage. We computer the shortest path distances between repurposed drug and (i). indicated disease; and (ii). a sample of 1000 non-indicated diseases. For the later, we have reported the mean randomized shortest path distance with 95% confidence intervals. We applied a nonparametric statistical analysis to assess how often the indicated shortest path distance is greater than the random shortest path distance and applied a Bonferroni correction to obtain the significance. At a threshold of P ≤ 0.05, relevant drugs are much closer to the indicated diseases than expected by random chance in 8 out of 10 cases. This analysis demonstrates the utility of PrimeKG for drug repurposing.